12 research outputs found

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

    Get PDF
    This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VÔro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Ashåninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF
    The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF

    Morphological inflectional rules for Karelian Proper verbs

    No full text
    A methodology for the development and implementation of inflectional rules for verbs in Karelian Proper is presented. The materials for this study were lemmas and word forms from the Open corpus of Veps and Karelian languages (VepKar) and the electronic version of the Karelian language Dictionary. The system of rules for automatic verb inflection for the Karelian Proper supradialect is of both practical and theoretical scientific interest. The new rules have already enabled entering 141 000 Karelian Proper word forms in the VepKar dictionary. The new program for word form generation has significantly reduced the time for adding the full inflectional paradigm of any Karelian Proper verb to the VepKar dictionary. One only needs to fill in several template parameters instead of 125 word forms. KokkuvĂ”te. Natalia Krizhanovskaya, Irina Novak, Andrew Krizhanovsky, Nataliya Pellinen: Morfoloogilised muutereeglid pĂ€riskarjala verbide jaoks. Artiklis esitletakse metodoloogiat, mida kasutati muutereeglite vĂ€ljatöötamisel ja rakendamisel pĂ€riskarjala verbide jaoks. Materjali moodustasid vepsa ja karjala keele avatud korpusest (VepKar) ning karjala keele sĂ”naraamatu elektroonilisest versioonist kogutud lemmad ja sĂ”navormid. Esmakordselt arendati vĂ€lja reeglite sĂŒsteem pĂ€riskarjala verbivormide automaatseks genereerimiseks. See on teaduslikult huvipakkuv nii praktilise kui ka teoreetilise poole pealt. Uued reeglid on juba vĂ”imaldanud lisada 141 000 pĂ€riskarjala sĂ”navormi VepKar sĂ”naraamatusse. Uus sĂ”navormide genereerimise programm on oluliselt vĂ€hendanud aega, mis kulub tĂ€ieliku muuteparadigma lisamisele mingi pĂ€riskarjala verbi juurde Vepkar sĂ”naraamatus. 125 sĂ”navormi asemel on selleks ĂŒksnes vaja tĂ€ita mallid mĂ”ningate parameetritega

    The Open Corpus of the Veps and Karelian Languages: Overview and Applications

    Get PDF
    Abstract. A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 1
    corecore